s
(t)
= f(s
(t1)
; θ)
s
(t)
s
(t-1)
s
(t)
s
(t+1)
f
s
(…)
s
(…)
f f f
t f t
t + 1 θ f
x
(t)
s
(t)
= f(s
(t1)
, x
(t)
; θ)
s
(t)
= g
(t)
(x
(t)
, x
(t1)
, x
(t2)
, . . . , x
(2)
, x
(1)
)
(x
(t)
, x
(t1)
, x
(t2)
, . . . , x
(2)
, x
(1)
)
s
(t)
t
(x
(t)
, x
(t1)
, x
(t2)
, . . . , x
(2)
, x
(1)
) s
(t)
s
(t)
g
(t)
t t
= t
f
s
x
s
(t-1)
s
(t)
s
(t+1)
x
(t-1)
x
(t)
x
(t+1)
s
()
s
(…)
f
Unfold
f f f
x s
t t + 1
θ
x
t
x
t1
x
t+1
x
unfold
V
W
W
W
W
W
V
V
U
U
U
U
s
o
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
U V W
a
(t)
= b + W s
(t1)
+ U x
(t)
s
(t)
= tanh(a
(t)
)
o
(t)
= c + V s
(t)
p
(t)
= softmax(o
(t)
)
b c
U V W
(x, y)
L(x, y) =
t
L
(t)
=
t
log p
(t)
y
(t)
y
(t)
t
U
V
W
o
(t-1)
h
o
y
L
x
o
(t)
o
(t+1)
L
(t-1)
L
(t)
L
(t+1)
y
(t-1)
y
(t)
y
(t+1)
h
(t-1)
h
(t)
h
(t+1)
x
(t-1)
x
(t)
x
(t+1)
WW W W
o
(…)
h
(…)
V V V
U U U
Unfold
t x
t
h
(t)
o
(t)
y
t
L
(t)
h h
o o
h h
o
o
(t-1)
o
(t)
h
(t-1)
h
(t)
x
(t-1)
x
(t)
W
V V
U U
o
(t-1)
o
(t)
L
(t-1)
L
(t)
y
(t-1)
y
(t)
h
(t-1)
h
(t)
x
(t-1)
x
(t)
W
V V
U U
Train time Test time
y
(t)
h
(t+1)
y
(t)
o
(t)
h
(t-1)
W
h
(t)
h
(t+1)
x
(t-1)
x
(t)
x
(t+1)
W W
U U U
h
(T)
x
(T)
W
U
o
(T)
y
(T)
L
(T)
V
o
(t)
U V W b c t
x
(t)
s
(t)
o
(t)
L
(t)
a
a
L
L
L
(t)
= 1
o
(t)
L t i, t
(
o
(t)
L)
i
=
L
o
ti
=
L
L
(t)
L
(t)
o
ti
= p
(t)
i
1
i,y
(t)
T
s
(T )
o
(T )
s
(T )
L =
o
(T )
L
o
(T )
s
(T )
=
o
(T )
L V .
L
s
T j
=
i
L
o
T i
V
ij
t = T 1 t = 1 s
(t)
t < T
o
(t)
s
(t+1)
s
(t)
L =
s
(t+1)
L
s
(t+1)
s
(t)
+
o
(t)
L
o
(t)
s
(t)
=
s
(t+1)
L diag(1(s
(t+1)
)
2
)W +
o
(t)
L V
diag(1 (s
(t+1)
)
2
)
1 (s
(t+1)
i
)
2
i t + 1
c
L =
t
o
(t)
L
o
(t)
c
=
t
o
(t)
L
b
L =
t
s
(t)
L
s
(t)
b
=
t
s
(t)
L diag(1 (s
(t)
)
2
)
V
L =
t
o
(t)
L
o
(t)
V
=
t
o
(t)
L s
(t)
W
L =
t
s
(t)
s
(t)
W
=
t
s
(t)
L diag(1 (s
(t)
)
2
)s
(t1)
s
(t)
L
s
(t)
s
(t)
L
s
(t)
W
s
(t)
b
L
(t)
o
(t)
y
(t)
{x
τ
}
τ<t
y y
y
y
y
y
y
y
τ
y
(t)
y
x y
y
τ
y
(t)
x
τ
y
τ
Y = (y
(1)
, y
(2)
, . . . , y
(T )
)
P (Y) = P (y
(1)
, . . . , y
(T )
=
T
t=1
P (y
(t)
| y
(t1)
, y
(t2)
, . . . , y
(1)
)
t = 1
y
(1)
y
(2)
y
(3)
y
(4)
y
(5)
y
(…)
y
(1)
, y
(2)
, . . . , y
(t)
, . . .
y
(τ)
y
(t)
t > τ
y
L =
t
L
(t)
L
(t)
= log P (y
(t)
= y
(t)
| y
(t1)
, y
(t2)
, . . . , y
(1)
).
y
(1)
y
(2)
y
(3)
y
(4)
y
(5)
y
(…)
s
(1)
s
(2)
s
(3)
s
(4)
s
(5)
s
(…)
s
(t)
y
(t)
(y
(1)
, . . . , y
(t1)
)
(t)
(y
(1)
, . . . , y
(t1)
)
s
(t)
s
(t)
s
(t)
s
(t)
s
(t1)
P (
(t)
|
(t1)
,
(t2)
, . . . ,
(1)
)
t
t
P (
(t)
|
(t1)
, . . . ,
(1)
) P (
(t)
|
(t1)
, . . . ,
tk
)
k
t
t
t
t
t
g
(t)
f
θ
X = (x
(1)
, x
(2)
, . . . , x
(t)
)
P (X) =
T
t=1
P (x
(t)
| g
(t)
(x
(t1)
, x
(t2)
, . . . , x
(1)
))
s
(t)
= g
(t)
(x
(t)
, x
(t1)
, x
(t2)
, . . . , x
(2)
, x
(1)
) = f
θ
(s
(t1)
, x
(t)
).
f
θ
x
tk
t
x
(T )
T
P (x
(1)
. . . , x
(T )
) P(T ) P (x
(1)
. . . , x
(T )
|
T )
x
(t)
T
T
T
y
(t)
x
(t+1)
x
(t)
y
(t)
P (y | ω)
ω
P (y | ω = f(x)).
x
y
s
(0)
x = x s
(0)
s
(t)
t > 0
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
y
t+2
y
t+1
y
t1
y
t
x
t
R
R
R
x Y y
(t)
x
R
x y
o
(t-1)
o
(t)
o
(t+1)
L
(t-1)
L
(t)
L
(t+1)
y
(t-1)
y
(t)
y
(t+1)
s
(t-1)
s
(t)
s
(t+1)
WW W W
s
(…)
s
(…)
V V V
U U U
x
(t-1)
y
(…)
R
x
(t)
x
(t+1)
R R
x y
y
(t)
x
(t)
y
(t)
x
(t)
y
(t+1)
x y y
(t)
o
(t1)
y
(t)
L
(t1)
y
(t)
o
(t1)
x
y y
(t)
P (y
(t)
| y
(t1)
, . . . , y
(1)
, x) = P (y
(t)
| x
(t)
, x
(t1)
, . . . , x
(1)
)
x
(t)
y
(t)
y
(t)
x
t
o
(t)
y
(t)
x
(t)
, x
(t1)
, . . . , x
(1)
y
(t)
o
(t-1)
o
(t)
o
(t+1)
L
(t-1)
L
(t)
L
(t+1)
y
(t-1)
y
(t)
y
(t+1)
h
(t-1)
h
(t)
h
(t+1)
x
(t-1)
x
(t)
x
(t+1)
g
(t-1)
g
(t)
g
(t+1)
x y L
(t)
t
h
g
t o
(t)
h
(t)
g
(t)
t
x
(1)
, . . . , x
(t)
t
h
(t)
g
(t)
o
(t)
t
t
(i, j) o
i,j
Encoder
x
(1)
x
(2)
x
(…)
x
(n
x
)
Decoder
y
(1)
y
(2)
y
(…)
y
(n
y
)
C
(y
(1)
, . . . , y
n
y
)
(x
(1)
, x
(2)
, . . . x
n
x
)
C
C C
X = (x
(1)
, . . . , x
n
x
)
C
Y = (y
(1)
, . . . , y
n
y
) n
x
n
y
log P (Y = Y |X = X) (X, Y ) s
n
x
C
C
C
C
x
t
h
t-1
h
t
y
t
z
t-1
z
t
x
t
h
t-1
h
t
y
t
x
t
h
t-1
h
t
y
t
N M
NM
t
t
> t
x
(1)
x
(2)
x
(3)
V V V
y
L
x
(4)
V
U V
o
U UV V
x
(1)
, x
(2)
, . . . o
U V W
y
N
N O(log N)
s
(t)
t
x
(1)
, x
(2)
, . . . , x
(t)
t s
(t)
J
(t)
=
s
(t)
s
(t1)
J
(t)
J t
J
v λ
g Jg n
J
n
g
g g + δv
J(g + δv) n J
n
(g + δv)
g g + δv
δJ
n
v n v
J λ
δ|λ|
n
. v |λ|
δ
|lambda| > 1 δ|λ|
n
|λ| < 1
s
(t+1)
= s
(t)
W
s
(t)
s
(t+1)
s
(t)
s
(t+1)
W J
s
(t)
s
(t)
tanh
tanh
1
J W
W
t t + 1
x
t
x
t1
x
t+1
x
unfold
s
o
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
W
1
W
3
W
1
W
1
W
1
W
1
W
3
s
t2
W
3
W
3
W
3
W
(1)
W
(3)
d O(λ
T
)
T λ
s
(t)
s
(t1)
O(λ
T/d
) T/d
˙s
i
τ
i
= s
i
+ σ(b
i
+ W
i,:
s + U
i,:
x)
σ τ
i
> 0
˙s
i
s
i
˙s
i
τ
i
= s
i
+ (b
i
+ W
i,:
σ(s) + U
i,:
x)
s s
i
τ
s
(t+1)
i
s
(t)
i
=
s
(t)
i
τ
i
+
1
τ
i
σ(b
i
+ W
i,:
s
(t)
+ U
i,:
x
(t)
)
s
(t+1)
i
= (1
1
τ
i
)s
(t)
i
+
1
τ
i
σ(b
i
+ W
i,:
s
(t)
+ U
i,:
x
(t)
).
1 τ
i
< τ
i
= 1
τ
i
> 1
τ
i
τ
i
τ (1
1
τ
)
τ τ
1
τ
i
σ
b
i
+ W s
(t)
+ U x
(t)
τ
i
×
input
input gate
forget gate
output gate
output
state
self-loop
×
+ ×
s
(t)
i
f
(t)
i
t i
f
(t)
i
= sigmoid(b
(f)
i
+
j
U
(f)
ij
x
(t)
j
+
j
W
(f)
ij
h
(t)
j
),
x
(t)
h
(t)
b
(f)
U
(f)
W
(f)
f
(t)
i
s
(t+1)
i
= f
(t)
i
s
(t)
i
+ h
(t)
i
σ(b
i
+
j
U
ij
x
(t)
j
+
j
W
ij
h
(t)
j
),
b U W
h
(t)
i
h
(t)
i
= sigmoid(b
e
i
+
j
U
e
ij
x
(t)
j
+
j
W
e
ij
h
(t)
j
).
h
(t+1)
i
q
(t)
i
h
(t+1)
i
= tanh(s
(t+1)
i
)q
(t)
i
q
(t)
i
= sigmoid(b
o
i
+
j
U
o
ij
x
(t)
j
+
j
W
o
ij
h
(t)
j
)
b
o
U
o
W
o
s
(t)
i
i
h
(t+1)
i
= u
(t)
i
h
(t)
i
+ (1 u
(t)
i
)σ(b
i
+
j
U
ij
x
(t)
j
+
j
W
ij
r
(t)
j
h
(t)
j
).
u r
u
(t)
i
= sigmoid(b
u
i
+
j
U
u
ij
x
(t)
j
+
j
W
u
ij
h
(t)
j
)
r
(t)
i
= sigmoid(b
r
i
+
j
U
r
ij
x
(t)
j
+
j
W
r
ij
h
(t)
j
).
Task network,
controlling the
memory
memory cells
writing
mechanism
reading mechanism
T
t < T
s
(T )
s
(t)
T t
s
(t+1)
s
(t)
θ
L
w b
||g|| g
if ||g|| > v
g
gv
||g||
v g
v
s
(t)
L
s
(t)
L
s
(t)
s
(t1)
s
(t)
L.
Ω =
t
|∇
s
(t)
L
s
(t)
s
(t1)
|
||∇
s
(t)
L||
1
2
.
s
(t)
x
L